Puzzles and Patterns in 50 Years of Research on Speech Perception

نویسنده

  • Sarah Hawkins
چکیده

The introduction of speech synthesizers and modern acoustic analysis in the mid-20th century allowed speech perception research to flourish, encouraging a period of broadly-based empirical work. Experiments examined diverse influences from information processing (intelligibility, statistical decision-making), biology/psychology (audition, memory, learning, hemispheric dominance), and linguistic-phonetics (local and non-local context, prosody, individual and stylistic variation). From this wealth of data, one dominant theme emerged: the puzzle that we feel we hear stable, or invariant, percepts of words and phonemes despite their enormous articulatory-acoustic variability in different contexts. Theories were developed to elegantly account for the transformation between variable signal and invariant perceptual unit. But what units, and what is elegant? The search focused on motoric or acoustic correlates of abstract phonological/phonetic units, which represent only information required to differentiate citation-form words. This emphasis on 'early abstraction', while elegant for phonology, cannot explain how natural speech is understood, yet shaped phonetic and psycholinguistic enquiry, as exemplified by the 'top-down vs. bottom-up' debates. In this period, new data accumulated on old topics (e.g. memory, speaker identity, multi-sensory, neuroscience, phonetic indicators of grammar, meaning and discourse function) and new ones (e.g. developmental and comparative perception, cross-linguistic studies), but were sidelined as too puzzling when they did not fit existing theory. However, since the 1990s data seem again to be forcing theoretical change, with significant shifts in the relative importance of older themes blurring distinctions between perceived signal and knowledge, de-emphasizing phonology, and re-emphasizing context. Thus the cyclic pattern between data puzzles and theory repeats. INTRODUCTION Public access to the sound spectrograph and subsequent developments in methods of acoustic analysis and synthesis (see companion papers in this volume by Ohala and Fant) made possible a rich period of research on speech perception during which basic investigative methods changed little. However, there have been shifts in the types of questions asked and the types of speech material regarded as informative. Rather than attempt a comprehensive review of the achievements of the entire period, this paper uses selected experiments in a number of different areas to trace the development of ideas about speech perception between the early 1950s and 2004. To a large extent, the theme I highlight is the conceptualization of what is ‘signal’, or relevant, and what is ‘noise’, or irrelevant, in speech perception. This translates methodologically into what should be experimentally manipulated versus controlled ‘out’ of an experiment. In theoretical terms, it translates into issues of invariance and variance, or essence and free random variation. From Sound to Sense: June 11 – June 13, 2004 at MIT 1 Hawkins. History of speech perception research Three broad periods are distinguishable. The first, lasting very roughly from about 1950 to the mid-60s, can be thought of as a period of ‘glorious discovery’ characterized by wide-ranging, imaginative enquiry about a huge range of issues. Many papers written during this period convey the excitement and indeed exuberance that must have accompanied the sense that ‘the speech code’ was about to be cracked, with the promise of advances in applications that such an achievement would herald. At the same time, two other themes were present from those early days. One was a profound sense of thoughtful disquiet about the apparent complexity of the critical properties that allow the human brain to make sense of the speech signal. The other, stemming partly from the first, and partly, perhaps, from the prevailing scientific ethos, was a disciplined rather than exuberant focus towards seeking answers rooted in simplicity of response. These latter two early threads became dominant in the second main period, lasting from the mid-60s to perhaps the mid-90s. This was a time dominated by theory of a very particular kind: the assumption that there must be an invariant relationship between the speech signal and its percept, and the search for the nature of that invariance. The third period began in the early-to-mid 1990s, and is still probably in its early stages. If it continues to develop along current lines, it marks a return to a broader focus of enquiry, with re-questioning of fundamental issues about the presumed nature of the processes involved in understanding speech. Data that are difficult to account for within the frameworks of the major theories tended to be sidelined but are now receiving more prominence and being combined with new understanding. These developments require changes in conceptualization of the task goals and the processes involved in speech perception. Some of them are so profound that it seems possible we are experiencing a paradigm shift in speech perception research. This paper is divided into the three periods, but each section includes references to work from other periods whenever doing so makes a clearer argument. In particular, there are frequent looks to the future in the description of work done in the earlier periods. EARLY WORK: ABOUT 1950-1965. ‘GLORIOUS DISCOVERY’ Early work often looked at effects on the whole signal, but as puzzles arose, and experimental designs and data were examined more closely, attention became increasingly focused on small domains in an effort both to simplify and to clarify. To give a flavor of the broad scope of early research, this section briefly reviews studies in a number of areas of relevance today. Source separation A first step to understanding speech is the ability to distinguish a particular signal from other sounds happening simultaneously in the environment. This ability allows the listener to group some noises together as emanating from a single vocal tract, while excluding others. It is particularly important—and indeed difficult—when the competing noises are speech, and for this reason the topic is often engagingly called the ‘cocktail-party effect’, although it is also called multi-talker perception. In a series of experiments, Cherry (1953), who introduced the term ‘cocktail party effect’, presented listeners with competing messages, either to both ears or to the two ears separately. The messages were always continuous natural speech, but were of a number of types and degrees of similarity and presented in a wide variety of different ways. The huge wealth of observations Cherry reported included greater accuracy if responses were written (implying that writing helps reduce memory load), loss of meaning in the attended ear during shadowing, of information about the speech in the unattended ear, and the effects of switching the same message between the two ears at different rates. His methods and results From Sound to Sense: June 11 – June 13, 2004 at MIT 2 Hawkins. History of speech perception research are reported in an exuberant and somewhat vague way that might not meet accepted publication standards today, but the paper is nevertheless worth reading, partly for its enthusiasm and inventiveness, but also because it makes many observations that are still relevant today. For example, Cherry notes the roles of memory, attention, and transitional probabilities in the message, as well as distinctions between the speaker and the content of the message, which are reflected in what listeners are aware of under particular monitoring conditions. In contrast with Cherry’s (1953) relatively naturalistic and wide-ranging approach, Broadbent & Ladefoged (1957) adopted a narrower but more explicit and systematic methodology and reporting style to investigate the cocktail party problem. They focused the general question “How can we recognize what one person is saying when others are speaking at the same time?” onto the more specific “How, when hearing two vowels such as // and // simultaneously, does the listener group the appropriate formants together so that he is aware he is listening to those two particular vowels rather than some other combination of formants?” They reasoned that place theories of hearing, in which signal frequency is resolved on the basilar membrane, should not predict differences between monaural and binaural listening conditions. Their results demonstrated that separate synthetic formants fuse to sound like a single vowel (coming from a single spatial location) only when they have the same f0, regardless of whether they are presented to the same or different ears. They related their results to periodicity pitch and place theories of hearing under development at the time. Fundamental frequency is still considered basic to auditory grouping of speech today (e.g. Brokx & Nooteboom, 1982; Darwin, 1981, 1997), and, together with other factors, plays an important role in current work on auditory scene analysis (e.g. Bregman, 1990; Darwin, 1997; Cooke & Ellis, 2001; Carlyon et al., 2002). Source integration The other side of source separation is, of course, source integration, or how sensation from different modalities is combined into the percept of a single signal. As early as 1954, Sumby & Pollack demonstrated that audiovisual presentation increases intelligibility of citation-form monosyllables, spondees, and trisyllabic phrases comprising a spondee and a monosyllable. Importantly, they stressed that the visual contribution is greatest when the words to be identified are presented in rather high levels of background noise, and thus that the visual contribution is relative to the available auditory contribution. Audio-visual speech perception has been studied and modeled by Massaro and colleagues throughout the middle period of covered by this paper (see Massaro, 1998) and is currently used in many speech technology applications as well as representing a thriving area of theoretical enquiry. Sumby & Pollack (1954) also noted that polysyllables are more intelligible that monosyllables in auditory-only presentations, an observation that figures large in today’s modeling, in the guise, for example, of attention to overall spectral pattern and envelope shape (Klatt, 1979), neighborhoods (Luce et al., 1990), cohorts (Marslen-Wilson, 1987), and so on, as well as in such well-known effects are the perceptual restoration effect (Warren 1999). From Sound to Sense: June 11 – June 13, 2004 at MIT 3 Hawkins. History of speech perception research Brain function: cerebral dominance In the early and middle periods covered by this paper, the way the brain functions during speech production and perception were not much better understood than they were in the late 19 century, when Broca’s and Wernicke’s areas were first identified and modeled, because brain function was hard to study for ethical reasons. Nevertheless, by the second half of the 20 century, some information could be learned from—or in connection with—clinical examination of patients being considered for brain surgery, primarily to control epileptic seizures. Lesioning the epileptic foci could control the seizures or reduce their severity, but it was important to ensure that the proposed lesions would not result in behaviors that would unacceptable to the patient. There was thus a brief opportunity to study brain function in such patients while surgery was being planned. In one such study, Kimura (1961a, b) presented groups of six digits, three to each ear, and had patients report whatever they could remember. She concluded that speech is processed more efficiently in the ear that is contralateral to the language-dominant hemisphere, independent of the patient’s handedness and of whether the damage due to epilepsy was in the right or the left hemisphere. This study demonstrates the complexities of the auditory pathways and of the concept of cerebral dominance and its relation to speech processing as opposed to the representation of speech in the cerebral hemispheres. Despite enormous advances in investigative technology and understanding since the 1990s, it is fair to say that complexity still dominates our understanding of speech processing today (see below). As such, Kimura’s work was a forerunner of today’s new fields of cognitive psychology and cognitive neuroscience, as well as a large number of dichotic listening (or duplex perception) studies within mainstream speech perception in the middle period, many of which are discussed by Liberman & Mattingly (1985). Memory Although early behaviorist/associationist work made clear that memory plays an important role in speech processing (for representative work, see Underwood, 1957, 1882), this huge literature seems to have had rather little influence on mainstream speech perception research. However, valuable observations relevant to speech were made during this period. For example, Miller (1956) noted that the typical human adult’s short-term memory span of just seven plus or minus two is inadequate for many tasks, but can be increased in a number of ways: by making relative rather than absolute judgments, by increasing the number of dimensions subsumed within a category, and by chunking into larger items. Recoding was thus seen as a crucial process to successful storage and retrieval. The importance of memory to speech sound categorization— and consequent methodological implications—was developed in the middle period, notably by Pisoni (1973 and later), but has gained greater prominence in recent years with increased interest in exemplar (or episodic) memory, as discussed below. Related work at this early time included Lashley’s (1951) seminal work on serial order in behavior. This stressed hierarchical structure, and although it probably influenced speech production more than perception, it nevertheless focused thinking, particularly about rhythm and timing. Context The work described so far gives a flavor of the wide range of investigations into general influences on speech perception during the 1950s and early 1960s. At the same time, a significant body of work investigated directly or indirectly the influence of various types of From Sound to Sense: June 11 – June 13, 2004 at MIT 4 Hawkins. History of speech perception research context on speech perception. This section addresses early research on context because views about the role of context in speech perception that developed during this early period have arguably shaped thinking up to the present time. The first two examples address speech intelligibility, while later examples cover speech sound identification in various contexts. Influence of the context of possible stimuli (or responses) on speech intelligibility Figure 1. Intelligibility of monosyllables as a function of the size of the test vocabulary and the degree of background noise. From Miller, Heise & Lichten (1951). 1 Miller et al.’s (1951) classic study on the intelligibility of English monosyllables as a function of the size of test vocabulary and degree of background noise shows how listeners’ understanding of what responses (i.e. stimuli) are acceptable affects their actual responses. As Figure shows, when the number of possible responses is fairly small, intelligibility is good, even at high relative levels of background noise. But when the response set is as broad as all the possible monosyllables of English, listeners are only about 60% correct even in quite good listening conditions. Miller et al. (1951) did not point out, but were presumably aware, that these conclusions will be affected by the degree of similarity of the stimuli, especially for small vocabularies. The curve for the two-word response set could reflect a pair of digits like two vs. six, whereas a curve for five vs. nine would lie much further to the right. Influence of broad phonetic context on speech intelligibility Consistent with the prevailing ethos of examining properties of the whole signal. Pickett & Pollack (1963) took a different approach to speech intelligibility. They showed that excerpts from connected speech must be at least 800 ms long to be fully intelligible. This normally represents at least two or three syllables. A valuable and imaginative aspect of Pickett & Pollack’s work is there demonstration that the figure of 800 ms holds regardless of speech rate. Faster rates need more syllables to be understood, and slowing the speech down does not help. Their data thus demonstrate the fundamental role of coarticulation and speech style—today often called ‘connected speech processes’—to speech perception. Connected speech processes differ at different rates of speech, and the speech can only be reliably understood when enough has been heard to provide a suitable context for interpretation. Whether this fact should be From Sound to Sense: June 11 – June 13, 2004 at MIT 5 Hawkins. History of speech perception research interpreted as noise to be filtered out of the signal, or as potentially informative and hence central to the signal, is a crucial problem that still challenges theoreticians today. Influence of long-domain preceding context on the interpretation of the current sound For Pickett & Pollack (1963) to attempt to isolate exactly what in the signal influenced intelligibility would have gone far beyond the scope of a single paper, and we are unable to give an answer for rapid connected speech even today. However, an experiment by Ladefoged & Broadbent (1957) illustrates that there were attempts to isolate particular influencing factors within broad-based contexts. Ladefoged & Broadbent (1957) found that when a synthetic syllable whose formant structure makes it ambiguous between the words bit and bet is played after the precursor sentence Please say what this word is, it is identified as bit when the precursor has a relatively high F1 (380-660 Hz) but as bet when the precursor has a relatively low F1 (200-380 Hz). In other words, listeners judge the identity of current sounds partly on their inherent formant structure (of course) and partly relative to that of the preceding context, which, in this case, can indicate the speaker’s vocal-tract length. (The stimuli can be heard at http://www.jladefoged.com/acousticdemos/acoustics/acoustics.html, or with more discussion, at http://rvl4.ecn.purdue.edu/~malcolm/interval/1997-056/VowelQuality.html). This experiment is sometimes criticized as not being replicable, especially with better-quality speech (but see Ladefoged, 1987). In the present discussion, this criticism is not relevant. The point is that the context, listening conditions, and available responses can influence listeners’ percepts sufficiently for them to hear a single stimulus as two (or more) different words in different contexts. That this may not normally happen probably partly attests to the robustness of natural speech. But that it can happen must be accounted for by any theory of speech perception, just as reversing figure/ground stimuli and other visual illusions are seen as fundamental to any account of visual perception. Influence of immediate context on interpretation of the current sound Simultaneously with the broad-based enquiries described above, a significant body of other work was focusing on detailed attributes of the immediate context in determining the identification of particular speech sounds. Much of this pioneering work originated at Haskins Labs. Acoustic cues to the place of articulation of stop consonants received particular attention in the early 1950s because they were observed to be the most complicated and elusive (i.e. surprising) sounds. A large number of papers were produced, from which I have chosen parts of just three, representative on the one hand for their deeply thoughtful nature and the profound sense of disquiet that they convey about the unexpectedly ‘encoded’ relationship between phonemic category and acoustic signal, and on the other hand for the huge influence they had on theory. Cooper et al. (1952) reported a wide range of work on the acoustic cues to stop consonants, /m/ and /l/, and vowels. Their experiments on stops were particularly influential. In one, they used the Haskins Pattern Playback to examine the relationship between the center-frequency of a stop burst and the frequencies of seven two-formant vowels. Both formants were steady-state throughout and, importantly, there were no transitions. As shown in Figure 2 they constructed 84 different stimuli by combining each of 12 different burst frequencies with these seven vowels. From Sound to Sense: June 11 – June 13, 2004 at MIT 6 Hawkins. History of speech perception research Figure 2. Stimulus design for experiments on the relationship between centrefrequency of the burst and formant frequency of the following two-formant vowel, using the Haskins Pattern Playback. A. The frequencies of the twelve bursts of noise appended to each vowel. B. Frequencies of the two formants for each of seven vowels. C. The design of a single stimulus from the 12 x 7 = 84 combinations examined. From Cooper et al. (1952). Figure 3. Preferred identifications by 30 listeners of the stimuli of Figure 2. Vertical axis: Frequency (Hz), marked as centre frequencies of 12 bursts, and also applying to the formant frequencies of the seven transitionless two-formant vowels, shown along the horizontal axis. The ‘zones’ show the burst-vowel combinations for which /p/, /t/, or /k/ responses were dominant; symbol size roughly indicates the extent of dominance. From Cooper et al. (1952). From Sound to Sense: June 11 – June 13, 2004 at MIT 7 Hawkins. History of speech perception research Listeners’ three-alternative forced-choice identification of the stops heard from these bursts and transitionless vowels patterned in a way that led Cooper et al. (1952) to identify the CV syllable as the minimal acoustic unit. As Figure 3 shows, most bursts whose centre-frequencies were significantly higher than F2 in the vowel were identified as /t/; most whose bursts were close to or slightly higher than F2 were heard as /k/; the rest were heard as /p/. In consequence, whether a CV was heard as beginning with /p/ or /k/ depended on the relative frequencies of the burst and the vowel formants. Paired with appropriate vowels, the same burst frequency could produce the percept /pikapu/. While the details of the data have limited relevance to perception of real speech, for example because the third formant was not included, there were no transitions, and the Pattern Playback produced stimuli that were inherently unsatisfactory in a number of ways, this study is nevertheless remarkable. First, despite the unnatural stimuli, it established as critical the relationship between burst frequency and F2 (and possibly higher formants), and thus the dynamic and relational basis of perceived phonemic categories: “in other words, the perception of these stimuli, and also, perhaps, of their spoken counterparts, requires the consonant-vowel combination as a minimal acoustic unit” (1952:598). Second, because this experiment made explicit the puzzle that a stop consonant requires a vowel to be heard, it raised the question of the status of the vocalic transitions, which began to be explored in this and subsequent papers. Third, the thinking in the rest of the paper is clearly leading towards a dynamically-specified, relational theory, though it is obvious that it was extraordinarily hard work: like many other papers in this early period, the figures are much more complicated than is typical of the middle period, presumably because investigators tried to capture multidimensional properties of speech sounds before ideas had crystallized into a commonly-accepted (and often over-simplified) format. Fourth, the admirable degree of speculation already stresses not just relational rather than absolute values, but also binary decisions about independent features, perhaps influenced by, but stated to be different from, the thinking behind the features of Jakobson et al. (1952). Fifth, the search for simplicity, so characteristic of the middle period, is already evident: Cooper et al. hoped to find that “not more than 2 or 3 cues” (1952:603) would be necessary to produce highly intelligible synthetic speech, and they also noted that intelligibility and naturalness are not the same. Engagingly, they expected that such synthetic speech might be more resistant to noise than natural speech: this is one of the few points this paper makes that turns out not to be true. Finally, this paper anticipated trading relations in speech perception, a subject of much research at Haskins Labs in the middle period, and possibly Keyser and Stevens’ (1994, 2001) enhanced features: “while it is clear that bursts and transitions complement each other in the sense that when one cue is weak the other is usually strong, nevertheless, there may remain some syllables for which both cues together may not suffice, and one must then search for other cues.” (1952:603). Soon afterwards, Delattre et al. (1955) identified a systematic relationship between the pattern of F2 transition in burstless two-formant vowels and the percept of place of stop articulation, in CVs with a number of different vowels. Their stimuli are shown in Figure 4, with the stop percept associated with each pattern to the right of each panel. Broadly, stimuli with rising F2 are heard as /b/, those with F2 seemingly emanating from 1800 Hz are heard as /d/, and those high falling transitions and front vowels were heard as /g/. The alveolar locus frequency of 1800 Hz works remarkably well in practice, but the other locus frequencies identified are specific to the particular stimuli, especially for the velars, for which the third formant is essential. From Sound to Sense: June 11 – June 13, 2004 at MIT 8 Hawkins. History of speech perception research Curiously, the identification of particular acoustic loci marks a step towards the assumption that the perceptual system abstracts from the signal to (mathematical) functions that are defined in relational terms, yet it also seems to suggest an emphasis on absolute frequencies rather than the relative properties stressed by Cooper et al. (1952). However, the concept of loci in fact retains dynamical and hence relational attributes, in that a locus is essentially the particular frequency or frequency range of the highest-amplitude spectral prominence at stop release, and the shape of the following transition reflects the change in spectral shape during the first few milliseconds after the release. This relationship would later capture the main principles behind Stevens’ investigations of acoustic invariants for distinctive features, described below. Acoustic loci have been used in rule-based speech synthesis systems, e.g. those influenced by Dennis Klatt, and are still the subject of research today (e.g. Sussman et al.,1998). Figure 4. Two-formant Pattern-Playback stimuli used by Delattre et al. (1955) to show that the pattern of F2 transition can determine the percept of place of stop articulation. The F2 transitions were related to specific acoustic loci for each place of articulation. This experiment is important because it focused attention onto transitions alone rather than the burst + transition of naturally-spoken CV syllables. This had profound practical and theoretical consequences. A practical consequence was that, with other Haskins papers, it was probably influential in encouraging the use of very unnatural stimuli, and a tendency not to worry about the poor quality of the resultant percepts. This in turn may have led to a widespread tendency not to be vigilant about the phonetic quality of experimental stimuli claimed to represent particular phonemes, a problem that is endemic in the literature and becomes a point of contention from time to time. The immediate theoretical consequence of focusing attention on transitions was the discovery of categorical perception of stop consonants: that equal acoustic changes along some acoustic dimension leads to unequal percepts. As illustrated in Figure 5, Liberman et al. (1957) used stimuli like those of Delattre et al. (1955) to demonstrate categorical perception for place of articulation of voiced stops by changing the extent and direction of F2 transition in small, equalsized steps. As every textbook on speech describes, identification functions showed abrupt From Sound to Sense: June 11 – June 13, 2004 at MIT 9 Hawkins. History of speech perception research crossovers from (near-)100% identification of one phoneme to (near-)100% identification of another, together with enhanced discrimination for pairs of stimuli that crossed the identification boundary, and relatively poor discrimination of within-category pairs that both received the same phoneme label. Figure 5. Two-formant Pattern-Playback stimuli used by Liberman et al. (1957) in an experiment designed to study the F2 formant transition as a cue to the identification of stop consonants as /b, d, g/. F1 was constant for all stimuli. The onset frequency of F2 changed in equal acoustic steps (Hz) throughout the series. The experiment involved the discovery of categorical perception. The discovery of categorical perception of obstruent consonants, together with a theoretical bias in favor of binary oppositions typical of the time, encouraged a focused search for simple transformations from the encoded signal to an unambiguous, formal linguistic mental representation. This narrower focus required clear conceptualization of the identity of the important unit(s) of speech perception, and of the process of abstraction envisaged. On the whole, the units and levels of linguistic description were rather uncritically adopted, though not without some misgivings. Cooper et al. wrote: “we....had undertaken to find the ‘invariants’ of speech, a term which implies, at least in its simplest interpretation, a one-to-one correspondence between something half-hidden in the spectrogram and the successive phonemes of the message.....one should not expect always to be able to find acoustic invariants for the individual phonemes...we are trying to [compile] the code book, one in which there is one column for acoustic entries and another column for message units, whether these be phonemes, syllables, words, or whatever.” (1952:604-5). The issue of the units of speech perception would not receive intensive attention until the middle period (e.g. Savin & Bever, 1970; McNeill & Lindig, 1973). It is still unresolved today, and in any case, there are probably differences between languages and between tasks. Goldinger & Azuma (2003) offer a brief up-to-date review. Given the conceptual and methodological difficulties, this delay is not surprising: although early investigations sometimes acknowledged doubts about the perceptual relevance of the phoneme, it seems generally to have been felt to be more important to seek simple acoustic correlates of phoneme-like units, and models that reflected prevailing views of theoretical linguistics and information theory, than to seek a radically different approach in terms of mapping sound onto meaning. Summary: achievements of the early period The early period was notable both for breadth of enquiry and for a narrower focus on acoustic correlates of phoneme-like chunks of sound. As noted above, many issues that remain unresolved today were raised during this period, even if they were immediately set aside in order to make the current investigation tractable. Towards the end of the period, essentially all From Sound to Sense: June 11 – June 13, 2004 at MIT 10 Hawkins. History of speech perception research the important issues of speech perception were brought together into one comprehensive, brilliantly thoughtful and prescient paper by Halle & Stevens (1962). (See also Stevens & Halle (1967). Because this is a brief historical review, the present discussion is confined to the earlier paper.) Halle & Stevens (1962) proposed a model of speech recognition based on analysis-bysynthesis that is briefly described in many textbooks. They describe it as a two-stage model for speech processing before phoneme identification, each stage of which involves analysis-bysynthesis. (As phoneme identification is the output of the second stage, to describe the stages as taking place before identification seems open to misinterpretation; however, the proposed processing does occur beforehand.) Speaker-identity is eliminated in the first stage, whose output is phonetic parameters describing the relevant vocal-tract movements and source excitation types. The second stage eliminates ‘irrelevant’ variables like rate of speech, dialect, and contextual variants of phonemes, to give a phoneme string output. This string seems to be equated with message identity, which is of course too simple, as the authors clearly realize. The model is as relevant today as when it was proposed, partly because it proposes that recognition takes place through knowledge-driven comparisons between incoming signal and stored representations of signals that have already been heard. This is similar to models based on exemplar memory that are currently the subject of much controversy. Perhaps even more significant is the range of currently pertinent issues that Halle & Stevens touched upon. In the model itself, these include: tight connections between production and perception, identification of phonetic parameter tracks before attempts to identify phonemes, adaptation to new talkers, distinguishing (or not) between the message and the talker, and a control component that dictates the order in which comparision signals are generated on the basis of such things as the goodness of fit of comparisions already made, the preliminary analyses, and knowledge of phonotactic probabilities. They propose preliminary tentative identification of candidate phonemes from high-certainty signal properties such as excitation type: “For the recognition of continuous speech it may not always be necessary to have recourse to analysis-by-synthesis procedures. A rough preliminary analysis at each of the stages in Fig. 2 may often be all that is required—ambiguities as a result of imprecise analysis at these early stages can be resolved in later stages on the basis of knowledge of the constraints at the morphological, suntactic, and semantic levels.” (1962:158). In their discussion of the model, Halle & Stevens address the ‘segmentation problem’ and the many-to-one problem of matching input intensity-frequency-time patterns to stored memories of units (like words). Interestingly, such an exemplar-type approach seems to have been rejected in favor of generative rules more from the point of view of current technical limitations for machine recognition, rather than on assumptions about the limits of human storage. Only two major premises seem questionable. First, it is stated that “[the speaker] does not exert precise control over such factors as the detailed characteristics of the source or the damping of the vocal tract.” (1962:156). This may not be as self-evident now as it appeared to be in 1962, but a lot could hinge on what is meant by ‘precise’. Second, the authors stress the necessity for a preliminary analysis before the phoneme identification stage in order to reduce “variance due to irrelevant factors” (1962:157). They appear to mean systematic (and unsystematic) allophonic variation. Recent evidence indicates that a great deal of (non-phonemic) linguistic information is systematically available from allophonic variation, including the phonetic fine detail associated with connected speech processes (cf. Hawkins, 2003; Local, 2003 and references therein). These points of view were presumably adopted because, as noted earlier, it was assumed that From Sound to Sense: June 11 – June 13, 2004 at MIT 11 Hawkins. History of speech perception research the phoneme is an obligatory processing unit during speech perception. However, the model as set out in 1962 would require very little change to be able to map the signal to a wider range of stored linguistic (and other) information than only to phonemes. As discussed below, it is useful to regard phonemes as primarily units of maximal phonological contrast for identification of lexical items in citation form, rather than as an obligatory first stage in understanding connected speech. A final point is that, like so many of these early papers, Halle & Stevens’ (1962) work is unlikely to have been so impressive were it not for interdisciplinary collaboration that has characterized much of the successful research in phonetics and speech communication ever since. THE MIDDLE PERIOD: ABOUT 1965 TO 1995. ‘THE SEARCH FOR “ESSENCE”’ By the mid-60s, a great deal was known about speech perception, but much of it was difficult to tie together. There was a strong push to impose order on the comparative chaos of the earlier period of discovery. Although there were many exceptions which unfortunately cannot be discussed in this relatively brief review, attention became focused on the non-linearity between variation in the acoustic signal and the perceptual response: in other words, upon accounting for categorical perception. Many experiments were conducted on English and, increasingly, other languages, to explore the types of speech sounds subject to categorical perception, and, in one way or another, the conditions under which it took place (e.g. Lisker and Abramson, 1964; Abramson & Lisker, 1967; Lisker & Abramson, 1967; Lisker, 1986; Miller & Liberman, 1979; Harnad, 1987). Because actual boundaries were shown to be sensitive to a wide range of stimulus conditions (cf. Repp & Liberman, 1987) and the push towards discovering essence was so compelling, there was a tendency to view context as variability, and to control for it ever more stringently. At the same time, in order to discover the essential—or invariant—properties of a speech sound, it was necessary to develop a clear view of what is fundamental. It is only a slight exaggeration to say that during this period the basic syllable was identified by mutual consent as /ba/ (or possibly /da/, but not /ga/, which was too complicated). This syllable has the following properties: it is a CV, in isolation, and carries full stress. In typical experiments, the basic syllable was contrasted either with one or two other consonants, keeping the vowel constant, or with other vowels, keeping the consonant constant. Although this basic syllable seems implicitly to have been considered as having ‘no context’, it in fact has one: silence. But silence, of course, though by no means rare, is the least typical context in which to hear a syllable. In consequence, this period saw us sidelining, and in some cases completely losing sight of, many important aspects of speech, including polysyllables, unstressed syllables, prosody, accounting for rate changes, connected speech processes, the informativeness of systematic variation (especially in connected speech), meaning, and communication between individuals. The gains were that work could focus on the search for invariant correlates of speech sounds, the extension of attention to infants, animals, and new languages, and the development of theory. This section briefly discusses the two main theories developed during this period, the Motor Theory, and Quantal Theory (leading to acoustic/auditory invariance). More extended From Sound to Sense: June 11 – June 13, 2004 at MIT 12 Hawkins. History of speech perception research discussions of these and other theories, as well as work on infants and animals, together with bibliographic references to the original sources, can be found in textbooks e.g. Pickett (1999). The Motor Theory of Speech Perception In the original version of the Motor Theory of Speech Perception, (Liberman et al. 1967), listeners were said to interpret speech sounds in terms of the motoric gestures they would use to make those same sounds. Liberman & Mattingly (1985) revised this postulate to the intended gestures of the speaker (rather than the listener) in light of a huge body of data gathered in large part by the researchers at Haskins Labs, and discussed in the 1985 paper. The Motor Theory includes a number of distinctive postulates, such as that speech processing is modular, and its processes unique to humans (‘speech is special’). These and other principles are discussed extensively by Liberman & Mattingly (1985) and elsewhere, so need only to be acknowledged here. For this review, it is sufficient to note that the proposed unit of perception is the speaker’s intended gesture, and that identification of the intended gesture is synonymous with identifying a loosely-defined ‘phonetic category’ whose relationship to phoneme or phone is not clear. The Quantal Theory of Speech Production and Perception Quantal Theory, first outlined by Stevens (1972), and presented more completely by Stevens (1989), is well described by Stevens’ paper in this volume, as well as elsewhere, e.g. Pickett (1999). Figure 6 illustrates the basic premise: that steady changes in any given articulatory parameter (e.g. distance of a constriction from the glottis, or cross-sectional area of a constriction at a given point in the vocal tract) result sometimes in a large change in the acoustic response, and sometimes in very little change; likewise for changes in acoustic parameters and Figure 6. Schematic illustration of the basic principle of Quantal Theory. (a) plainfont axes, show changes in a relevant acoustic parameter as an articulatory parameter steadily changes. (b), italic-font axes, show changes in a relevant auditory response as an acoustic parameter steadily changes. The pattern of response is the same in both cases. From Sound to Sense: June 11 – June 13, 2004 at MIT 13 Hawkins. History of speech perception research the resultant auditory response. The basic inventory of sounds of each language is chosen from these regions of acoustic or auditory stability, because they allow for relative imprecision in exactly how the sound—and percept—is achieved by the speaker. Some principles come from acoustic theory, while others (often more speculative) come from auditory physiology. The units of perception are unambiguously defined in Quantal Theory as the phonological distinctive features of Chomsky & Halle (1968). These features are defined in articulatory terms, but Stevens began a systematic program to identify their acoustic or auditory correlates, and this quest culminated in acoustic/auditory invariance theory as a development of Quantal Theory. Acoustic invariance theory postulates that for each distinctive feature there is a binary response to an invariant acoustic or auditory property at a particular, crucial part of the signal. For consonants, these are typically particular types of change in spectral shape over short time periods across segment boundaries. For place of vowel articulation, they are typically places where two formants approach one another and remain in the same vicinity for some time (i.e. at vowel steady states). Importantly, immediate context is built into acoustic invariants, because the attributes that matter are those of one spectrum relative to another, at successive time intervals. Figure 7. Above, spectrograms of bet and wet. Below, spectra taken at 10-ms intervals centred at the points indicated by the arrows in the spectrograms, illustrating acoustic correlates of [+consonantal], left, and [-consonantal], right. 7 Figures and 8 illustrate the principles for the features [consonantal] and [strident] respectively. Sounds that are [+Consonantal] are characterized by rapid spectral change at their segment From Sound to Sense: June 11 – June 13, 2004 at MIT 14 Hawkins. History of speech perception research boundaries, whereas [–Consonantal] sounds are characterized by little spectral change at the equivalent places. This is illustrated in the lower panels of Figure , which each show three 26ms lpc spectra taken at successive 10-ms intervals centered at the points indicated by the arrows in the spectrograms above: in bet, at and just after the release of /b/; and in wet, at and just after the point of maximum change in the frequency of F2 as t he /w/ releases into the vowel. The feature [strident] is captured by the same principle of relational attributes, but the particular acoustic relationship is entirely different from that for [consonantal], as illustrated in Figure 8. When a sound is [+strident], its average spectrum at high frequencies (> c. 4 kHz) is higher in amplitude than the spectrum of the following vowel onset at the same frequencies. This can be seen in the left spectrum of Figure 8, for the /s/ and vowel onset of sing. The converse is the case for [–strident] sounds, whose high-frequency spectra are lower in amplitude than that of following vowel onset, as shown at the right of Figure 8 for the spectra of the fricative and vowel onset of thing. 7 Figure 8. Above, spectrograms of sing and thing. Below, average spectra of the fricatives, as indicated by horizontal arrows in the spectrograms, and centered at vowel onset, indicated by vertical arrows. Acoustic correlates of [+strident], left, and [-strident], right, appear as the difference in amplitude at high frequencies. The investigative program began with Stevens & Blumstein’s (1978) landmark paper on distinctive spectral templates for identification of place of stop articulation, and has continued to develop ever since. Stevens (2002) offers a comprehensive recent account which includes some significant shifts in orientation. Connected speech has begun to be addressed, and a number of other desirable properties added, such as the definition and identification of From Sound to Sense: June 11 – June 13, 2004 at MIT 15 Hawkins. History of speech perception research landmarks as islands of perceptual reliability, along the lines suggested as desirable by Halle & Stevens (1962). Common properties of Motor Theory and Quantal/Acoustic Invariance Theory These two theories have traditionally been portrayed as incompatible with one another, and the literature abounds with often impassioned arguments for or against one or the other approach. Setting aside their obvious differences in motoric vs auditory basis for perceptual units, they have in fact much in common. They both specify units as dynamic (and therefore tied to speech production), and they both assume early abstraction from the physical sensation to the discrete units of phonological description. Influences of speech perception theories on and from psycholinguistic theories One consequence of the major phonetic theories’ fundamental assumption of early abstraction to discrete phonological units was that it encouraged psycholinguistic theories of how speech is understood also to assume an input of abstract, discrete phonological units such as distinctive features or phonemes. That is, psycholinguistics was empowered to neglect phonetic information available from the physical signal. Although a lot of work was done on recognition of isolated words, the theoretical emphasis was not just on word identification, but also on word segmentation, which assumes connected speech. Various types of ‘top-down knowledge’ about, for example, metrical stress (combined with stress information carried by the signal), possible words, and phonotactic constraints, were introduced to compensate for the impoverished signal that a phonemic input provides. Much effort went into debating the point(s) in the recognition/access process at which top-down information plays a role, and whether it interacts with or is independent of the input signal; theories came to be defined in terms of their stand on these issues. As with phonetic theories of speech perception, elucidating how we understand meaning from a spoken utterance was sidelined in favor of elucidating how we identify more tractable units, in this case words; there was a tendency to assume that meaning was understood when the word was identified (which may be the case). Major influences in this field of enquiry include TRACE (McClelland & Elman, 1986), the Cohort model (Marslen-Wilson, 1975, 1987; Marslen-Wilson & Welsh, 1978), and the series of studies and models (RACE, SHORTLIST, MERGE) due to Norris, Cutler and colleagues (Norris, 1994; Norris et al., 2000) although these last were published in the third rather than the middle period. For a representative collection of papers, see Marslen-Wilson (1989). However limited in scope these theories were during this middle period, they had a positive influence on thinking within phonetic research. They forced awareness of powerful nonphonological influences on speech perception, from word frequency to morphology to meaning; and, along with successful machine speech recognition systems based on HMMs and neural nets, they encouraged phonetic theories to replace binary distinctions with probabilities or the like, which encouraged an interest in speech perception as a probabilistic pattern-matching task. These changes would begin to be seen in the third period. Extensions to and questions about the search for essence The emphasis in mainstream phonetic research on the development of elegant theory involving early abstraction to phoneme-sized units never went unchallenged. Questions were raised from the outset. This section outlines just three, somewhat arbitrarily chosen, examples. From Sound to Sense: June 11 – June 13, 2004 at MIT 16 Hawkins. History of speech perception research One question was essentially ‘is simplicity the best answer?’ When Quantal Theory was still relatively new, Klatt (1979) discussed his model of Lexical Access from Spectra (LAFS), in which the input would be matched against stored whole-word patterns. Klatt discussed the problems of how to represent the ends of words when they are so variable in connected speech, but did not provide a satisfactory answer. Had he lived long enough to benefit from recent knowledge, for example about the power of systematic fine phonetic detail to indicate linguistic structure in connected speech (Hawkins & Smith, 2001; Local, 2003 and references therein), this drawback might have been overcome. Likewise, shortly after Stevens & Blumstein began to propose spectral templates for acoustic invariants, Kewley-Port showed both that consonant-vowel transitions in natural CV syllables often fail to follow the patterns that the Haskins experiments suggest they should if they are to act as cues to stop place of articulation (1982), and that better identification of CVs was obtained by using a series of spectra made at 5-ms intervals over the first 40 ms from the release of the stop (1983), compared with just the two spectra from burst and vowel onset favored by invariance theorists. Evidently, more detail is helpful. This can be seen as ‘more context’. Figure 9 shows the type of spectra Kewley-Port (1983) recommended. Figure 9. 26-ms lpc spectra (Hanning window) taken at successive 5-ms intervals over the first 40 ms of the syllables /ba da ga/, as indicated. The first spectrum is centred on the stop burst. Wider influences on phoneme identity were also noted quite early in this period. For example, Ganong (1980) showed shifts in the phoneme boundary when a VOT series has a word at one end and a non-word at the other. The shift is in favor of the word: thus more /d/s are heard in a dash-tash series and more /t/s in a task-dask series. Perception is more forgiving when the sound means something. Summary: Achievements of the middle period The ascent of theory from the mid-60s to the mid-90s was probably necessary in order to systematize knowledge and focus enquiry. The main types of theory developed relied heavily on linguistic theory, but (probably to their ultimate detriment) tended to neglect phonetic detail in favor of early abstraction, elegance and simplicity. Although theorists’ initial intention seems to have been essentially to write a codebook for phoneme (or feature-bundle) identification, by the From Sound to Sense: June 11 – June 13, 2004 at MIT 17 Hawkins. History of speech perception research end of the period it was clear that this would not be possible in the foreseeable future. Moreover, there were many threads that were not easy to fit into the prevailing theories of speech perception but that were potentially compatible with more general theories of perception (e.g. Strange et al., 1983; Warren, 1970, 1999). Thus, by the mid-1990s, it was clear that many attributes of speech perception could not be accounted for by the main theories. Many people believed that focused debate about motoric vs acoustic representation was unproductive. It was time to broaden the field of enquiry again to include more natural speech in more natural (and wider) contexts, including connected speech processes, and to pay more attention to properties of linguistic units such as words that are more readily associated with meaning. Work along these lines had been in progress for some time (e.g. Pols, 1986), but had typically not received prominence in theoretical debate in the field. An important conclusion from this period was that, whatever the ‘units’ of speech perception are, they seem functionally inseparable from ‘context’. The context and the current signal together determine whether the speech sounds coherent, and hence what each unit ‘is’. RECENT DEVELOPMENTS (SINCE ABOUT 1995). It is too early to characterize the current period in a few words. However, since the early-to-mid 1990s, systematic subtle variation in fine phonetic detail has been receiving attention as potentially linguistically-informative, and efforts have been made to classify contexts that influence speech perception in more linguistically-sophisticated ways. Old and new themes have been combined, partly by re-examining and extending information provided by systematic phonetic variation, and partly by work in new areas. Three new areas are worth singling out: cross-linguistic work, memory and learning, and functional brain imaging. Work on the information conveyed by systematic fine phonetic detail and its role in perception also promises new insights. New cross-linguistic work Recent cross-linguistic work examines influences of the native language structure on perception, including, in some cases, influence of phonetic fine detail (e.g. Best, 1995; Beddor & Krakow, 1999; Beddor et al., 2001, 2002; Bradlow & Bent, 2002). This work shows that listener’s interpretation of a speech signal reflects complex interactions between their knowledge of vocal-tract dynamics and expectations derived from the structure of their native language. Bradlow (2002) makes similar points in an interactive framework reflecting LIndblom’s (1990) H&H; theory, which underlines that the usual context for speech perception is as part of a conversation with another person, and the plasticity of phonetic categories. New work on memory and learning Recent work on memory and learning has examined the potential role of exemplar (or episodic) memory for particular words. In speech perception, this work stems largely from decades of work on memory by Pisoni and colleagues (cf. Goldinger, 1998). However, most of the experimental support is inexplicit about the influencing phonetic parameters, and in any case appears to contribute to identification of the speaker rather than of the meaning or linguistic structure. However, very recent work has demonstrated that listeners do use fine phonetic detail. Allen & Miller (2004) showed that listeners can identify the speaker from distinctions in From Sound to Sense: June 11 – June 13, 2004 at MIT 18 Hawkins. History of speech perception research VOT. Most pertinently, Smith (2004) showed that fine phonetic detail can facilitate identification of words in connected speech. Slightly inappropriate allophones in a sentence disrupted wordspotting only when the speaker was familiar to the listener, even though all speakers used the same regional accent. Importantly, familiarisation to such fine-grained attributes of a listener’s speech appears to be fast—a matter of minutes. New work on brain function Within the vast body of speech-related literature emerging from the young field of cognitive neuroscience, some functional brain imaging studies are beginning to look at patterns of speech processing without necessarily adopting the traditional view that speech perception takes plays initially by early abstraction to phonological units. Coleman’s (1998, 2002) reviews are particularly valuable in this respect. Representative experimental work examining evidence various types of hierarchical processing includes Davis & Johnsrude (2003), Scott & Johnsrude (2003) and Scott & Wise (2003). Scott (2003) discusses this type of evidence with particular reference to the potential use of systematic fine phonetic detail in speech perception. Finally, Valaki et al. (2004) present fMRI data on recognition memory for words (from a clinical task used to establish cerebral dominance). Activation during this task was largely restricted to the left hemisphere for native speakers of Spanish and English, but most native speakers of Mandarin Chinese showed bilateral activity. Chinese is a tone language, and therefore the f0 contour contributes to lexical meaning. It has long been known that prosodic and musical functions tend to be right-lateralized. These data nicely confirm and extend Kimura’s (1961b) early work on cerebral dominance for speech, and highlight interesting questions about the functional organization of prosodic and segmental information in speech perception. Systematic fine phonetic detail Re-emphasis on the linguistic and social context of an utterance and re-examination of the detailed signal is providing increasing evidence that the systematic fine phonetic detail of speech not only reflects many attributes of the linguistic structure of utterances, from syllabic constituent to morphological status to word frequency, but can also be used by listeners to help understand speech. This evidence is encouraging radical questions and suggestions about perceptual processing under normal, as opposed to laboratory, conditions (cf. Hawkins & Smith, 2001; Hawkins 2003; Local, 2003; Pierrehumbert, 2002). It is encouraging that psycholinguists are showing an equally strong interest in fine phonetic detail (e.g. Gaskell & Marslen-Wilson, 1997, 2002; Davis, Marslen-Wilson & Gaskell, 2002, Mitterer & Blomert, 2003; Gow & McMurray, this volume). What sort of model? The present period has not yet produced a model of the elegance of those of the middle period that can account easily for both old and new observations. It seems reasonable to hope that new theories will aim to include the following attributes. They should be biologically plausible; include roles for attention, memory, and learning; focus on understanding meaning rather than identifying phonological form; allow for multiple potential ‘units of perception’, possibly with no obligatory units; and they should allow meaning and linguistic structure to be understood from incomplete information. Each of these themes can be found in at least one theory from the early and middle periods. For example, Fowler’s (1986) theory of Direct Perception is a more biologically-plausible version of the Motor Theory, if only because it is predicated on principles of perception in other modalities, and claims that the basic processes involved are not special to From Sound to Sense: June 11 – June 13, 2004 at MIT 19 Hawkins. History of speech perception research human perception. Jakobson, Fant, & Halle (1952) emphasized that we speak in order to be understood. Stevens’ early work acknowledged the power of reliable but incomplete information, and his most recent work (Stevens, 2002) has developed this. The sorts of phonetic models being proposed are too new to be fairly evaluated here. As a group, they are probabilistic and potentially make heavy use of fine phonetic detail (e.g. Johnson, 1997; Hawkins & Smith, 2001; Stevens, 2002; Hawkins 2003; Pierrehumbert, 2002, 2003). None, however, have got very far in elucidating exactly how we move seemingly effortlessly from understanding a message, to recognizing that a sound pattern is or is not a word, to identifying a phoneme in a phoneme-recognition task. One promising approach may be a combination of Stevens’ landmarks and relative invariance, enhanced features and analysisby-synthesis (Stevens, 2002), with Grossberg’s Adaptive Resonance Theory (e.g. Grossberg, 2003). As outlined in Hawkins & Smith (2001) and Hawkins (2003) such an approach might provide a biologically-plausible link between initial exemplar representation and fully structured linguistic message and understanding. Speech rhythm seems likely to be fundamental in guiding perceptual decisions and the perceived signal may be interpreted in terms of a rhythmically-based linguistic structure, such as some sort of prosodic tree. In neurobiological terms, this means that the cerebellum and it’s pathways to and from the cortex may be centrally involved with speech perception. Two key questions A key issue that needs addressing is what we mean by phonetic category. Fowler (pers. comm., May 2004) acknowledges that motor theories are not clear on this point. Past and current research demonstrate that the mental representations of phonetic categories must be dynamic, relational, and plastic (Lindblom & Studdert-Kennedy, 1967; Repp & Liberman, 1987; Lively & Pisoni, 1997; Bradlow & Bent, 2002). Although this is accepted as fundamental in other areas of work on perception and cognition, it seems often to have been lost sight of for speech. New theoretical perspectives might do well to emphasize it. A second key issue is to re-evaluate the distinction between bottom-up and top-down information. On the one hand, fine phonetic information that systematically indicates linguistic structure should make many model ‘top-down processes’ unnecessary. For example, fine allophonic detail can provide segmentation information that makes top-down use of abstract knowledge about possible word constraints redundant. On the other hand, such fine phonetic detail cannot be used in the absence of top-down knowledge about how it should be used—for this language, this accent, this speaker. The traditional distinction between signal and knowledge is thus likely to be blurred in future models. This seems entirely consistent with current understanding of brain functioning. A challenge The ideas developed during the last decade are exciting and may represent a paradigm shift. The challenge in the next decade or two will be to define and refine new questions in testable ways: to refocus, but to do it in ways that are rigorous yet focus on meaning and communication, avoid the ‘new understanding’ becoming doctrinaire, and that build on the considerable contributions of earlier work. This goal probably demands that we try to develop open-minded experimentation, and theory-building and testing in parallel. From Sound to Sense: June 11 – June 13, 2004 at MIT 20 Hawkins. History of speech perception research REMARKS ABOUT WHAT HAS BEEN OMITTED This review has necessarily been selective. Topics that could and indeed should have been discussed include: speech perception by infants & animals, vowel perception (dynamics, center of gravity), prosody and its relationship with segmental phonetics, and sine wave speech. Connections might usefully have been drawn between production and perception, and with psychoacoustics, and more aspects of memory, including associative memory and learning. Finally, a number of theories of speech perception that have contributed valuable data and insights were neglected, including direct perception, auditory enhancement, and FLMP. My apologies to the many researchers whose excellent work was also not mentioned. ACKNOWLEDGEMENTSI thank Tom Baer, Sharon Manuel, Fran Perler, Geoffrey Potter, Barbara Shinn-Cunningham,Janet Slifka and Rachel Smith for help in preparation of the oral and/or written versions of thispaper. I am also deeply grateful to the people who kindly and enthusiastically told me whichpapers and issues in the last fifty years of speech perception they considered the mostimportant and influential, and why. They are: Cathi Best, Ann Bradlow, John Coleman, MartinCooke, Wim van Dommelen, Steve Grossberg, Sharon Manuel, Noël Nguyen, JanetPierrehumbert, David Pisoni, Robert Remez, Arty Samuel, Betty Tuller and Doug Whalen. Nonereferred to their own work, though I have chosen to refer to some of it. For better or worse, alldecisions about what to include and what slant to give the paper were my own. Supported inpart by a Major Research Fellowship from the Leverhulme Trust. REFERENCESAbramson, A. & Lisker, L. (1967) Discrimination along the voicing continuum: Cross-languagetests, Status Report on Speech Research, Haskins Laboratories, 17-22. Allen, J.S. & Miller, J.L. (2004) Listener sensitivity to individual talker differences in voice-onset-time, Journal of the Acoustical Society of America, 116, 3171-3183. Beddor, P.S., Krakow, R.A. & Lindemann, S. (2001) Patterns of perceptual compensation andtheir phonological consequences, In The Role of Perceptual Phenomena in Phonology(edited by E. Hume and K. Johnson), San Diego: Academic Press, 55-78. Beddor, P.S., Harnsberger, J. & Lindemann, S. (2002) Language-specific patterns of vowel-to-vowel coarticulation: acoustic structures and their perceptual correlates, Journal ofPhonetics, 30, 591-627. Best, C. T. (1995) A direct-realist view of cross-language speech perception, In SpeechPerception and Linguistic Experience: Issues in Cross-Language Speech Research, (editedby W. Strange), Baltimore: York Press, 171-206. Bradlow, A. (2002) Confluent talkerand listener-oriented forces in clear speech production. InLaboratory phonology VII (Phonology and Phonetics) (edited by C. Gussenhoven & N.Warner), Berlin: Mouton de Gruyter, 241–273. Bradlow, A. & Bent, T. (2002) The clear speech effect for non-native listeners. Journal of theAcoustical Society of America, 112, 272–284. Bregman, A.S. (1990) Auditory Scene Analysis: The Perceptual Organization of Sound.Cambridge, MA: MIT Press. Broadbent, D.E. & Ladefoged, P. (1957) On the fusion of sounds reaching different senseorgans. Journal of the Acoustical Society of America 29, 708-710. From Sound to Sense: June 11 – June 13, 2004 at MIT21 Hawkins. History of speech perception research Brokx, J.P.L. & Nooteboom, S.G. (1982) Intonation and the perceptual separation ofsimultaneous voices. Journal of Phonetics, 10, 23-36. Carlyon, R.P., Deeks, J.M., Norris, D. & Butterfield, S. (2002) The continuity illusion and vowelidentification, Acta Acustica united with Acustica, 88, 408-415. Cherry, E.C. (1953) Some experiments on the recognition of speech, with one and two ears.Journal of the Acoustical Society of America, 25, 975-979. Chomsky, N. & Halle, M. (1968) The Sound Pattern of English, New York: Harper and Row. Coleman, J. S. (1998) Cognitive reality and the phonological lexicon: A review. Journal ofNeurolinguistics, 11, 295–320. Coleman, J. (2002) Phonetic representations in the mental lexicon. In Phonetics, Phonology andCognition (edited by J. Durand and B. Laks), Oxford: Oxford University Press, 96–130. Cooke, M.P. & Ellis, D.P.W. (2001). The auditory organization of speech and other sources inlisteners and computational models, Speech Communication, 35, 141–177. Cooper, F.S., Delattre, P.C., Liberman, A.M., Borst, J.M. & Gerstman, L.J. (1952) Someexperiments on the perception of synthetic speech sounds, Journal of the Acoustical Societyof America, 24, 597-606. Darwin, C.J. (1984) Perceiving vowels in the presence of another sound: Constraints on formantperception. Journal of the Acoustical Society of America, 76, 1636-1647. Darwin, C. J. (1997) Auditory Grouping. Trends in Cognitive Sciences 1, 327-333. Davis, M.H. & Johnsrude, I.S. (2003) Hierarchical processing in spoken languagecomprehension. Journal of Neuroscience, 23, 3423–3431. Davis, M. H., Marslen-Wilson, W. D. & Gaskell. M. G. (2002) Leading up the lexical garden-path: segmentation and ambiguity in spoken word recognition, Journal of ExperimentalPsychology: Human Perception and Performance, 28, 218-244. Delattre, P.C., Liberman, A.M. & Cooper, F.S. (1955) Acoustic loci and transitional cues forconsonants. Journal of the Acoustical Society of America, 27, 769-773 Fowler, C.A. (1986) An event approach to the study of speech perception from a direct-realistperspective. Journal of Phonetics, 14, 3-28. Ganong, W.F. (1980) Phonetic categorization in auditory word perception, Journal ofExperimental Psychology: Human Perception and Performance, 6, 110-125. Gaskell, M.G. & Marslen-Wilson, W.D. (1997) Integrating form and meaning: A distributedmodel of speech perception, Language and Cognitive Processes, 12, 613-656. Gaskell, M.G. & Marslen-Wilson, W.D. (2002) Representation and competition in the perceptionof spoken words, Cognitive Psychology, 45, 220-266. Goldinger, S. D. (1998) Echoes of echoes? An episodic theory of lexical access. PsychologicalReview, 105, 251–279. Goldinger, S.D. & Azuma, T. (2003) Puzzle-solving science: the quixotic quest for units inspeech perception, Journal of Phonetics 31, 305–320. Gow, D. & McMurray, B. (this volume) From sound to sense and back again: The integration oflexical and speech processes. Grossberg, S. (2003) Resonant neural dynamics of speech perception, Journal of Phonetics,31, 423-445. From Sound to Sense: June 11 – June 13, 2004 at MIT22 Hawkins. History of speech perception research Halle, M., & Stevens, K. (1962) Speech recognition: A model and a program for research. IRETransactions on Information Theory, IT-8, 155-159. Harnad, S. (1987) Categorical Perception: The Groundwork of Cognition, Cambridge:Cambridge University Press. Hawkins, S. (2003) Roles and representations of systematic fine phonetic detail in speechunderstanding, Journal of Phonetics, 31, 373-405. Hawkins, S. & Smith, R. (2001) Polysp: A polysystemic, phonetically-rich approach to speechunderstanding, Italian Journal of Linguistics—Rivista di Linguistica, 13, 99-188.http://kiri.ling.cam.ac.uk/sarah/TIPS/hawkins-smith-01.pdf. Jakobson, R., Fant, C.G.M., & Halle, M. (1952) Preliminaries to speech analysis: The distinctivefeatures and their correlates, Acoustics Laboratory Technical Report 13, MassachusettsInstitute of Technology, Cambridge, MA: reprinted by NUT Press, Cambridge MA, 1967. Johnson, K. (1997) Speech perception without speaker normalization: An exemplar model, InTalker variability in Speech Processing, (edited by K. Johnson and J.W. Mullennix), SanIDego: Academic Press, 145-165. Keyser, S.J. & Stevens, K.N. (1994) Feature geometry and the vocal tract, Phonology 11, 207–236. Keyser, S.J. & Stevens, K.N. (2001) Enhancement revisited, In Ken Hale: A Life in Language,(edited by M. Kenstowicz) Cambridge, MA, MIT Press. Kewley-Port, D. (1982) Measurement of formant transitions in naturally produced stopconsonant-vowel syllables, Journal of the Acoustical Society of America, 72, 379-389. Kewley-Port, D. (1983) Time-varying features as correlates of place of articulation of stopconsonants, Journal of the Acoustical Society of America, 73, 322-335. Kimura, D. (1961a) Some effects of temporal-lobe damage on auditory perception, CanadianJournal of Psychology, 15,156-165. Kimura, D. (1961b) Cerbral dominance and the perception of verbal stimuli, Canadian Journal ofPsychology, 15, 166-171. Klatt, D.H. (1979) Speech perception: A model of acoustic-phonetic analysis and lexical access,Journal of Phonetics, 7, 279-312. Ladefoged, P. (1987) A note on “Information conveyed by vowels.” Journal of the AcousticalSociety of America, 85, 2223-2224. Ladefoged, P. & Broadbent, D.E. (1957) Information conveyed by vowels, Journal of theAcoustical Society of America, 29, 98-104. Liberman, A.M., Cooper, F.S., Shankweiler, D.P. & Studdert-Kennedy, M. (1967) Perception ofthe speech code. Psychological Review, 74, 431-461. Liberman, A.M., Harris, K.S.,Hoffman, H.S. & Griffith, B.C. (1957) The discrimination of speechsounds within and across phoneme boundaries. Journal of Experimental Psychology, 54,358-368. Liberman, A.M. & Mattingly, I.G. (1985) The motor theory of speech perception revised.Cognition, 21, 1-36. From Sound to Sense: June 11 – June 13, 2004 at MIT23 Hawkins. History of speech perception research Lindblom, B. (1990) Explaining phonetic variation: A sketch of the H&H; theory. In Speechproduction and speech modelling, (edited by W.J. Hardcastle and A. Marchal), TheNetherlands: Kluwer Academic, 403-439. Lindblom, B. & Studdert-Kennedy, M. (1967) On the role of formant transitions in vowelrecognition, Journal of the Acoustical Society of America, 42, 830-843. Lisker, L. (1986) “Voicing” in English: A catalogue of acoustic features signalling /b/ versus /p/ introchees, Language and Speech, 29, 3-11. Lisker, L & Abramson, A. (1964) A cross-language study of voicing in initial stops: Acousticalmeasurements, Word, 20, 384-422. Lisker, L. & Abramson, A. (1967) Some effects of context on voice onset time in English,Language and Speech, 10, 1-28. Lively, S.E. & Pisoni, D.B. (1997) On prototypes and phonetic categories: A critical assessmentof the perceptual magnet effect in speech perception. Journal of Experimental Psychology:Human Perception and Performance, 23, 1665-1679. Local, J. (2003) Variable domains and variable relevance: Interpreting phonetic exponents,Journal of Phonetics, 31, 321–339. Luce, P.A. & Pisoni, D.B. (1998) Recognizing spoken words: the neighborhood activationmodel, Ear and Hearing 19,1-36. Luce, P., Pisoni, D.B. & Goldinger, S. (1990) Similarity neighborhoods of spoken words. InCognitive Models of Speech Perception: Psycholinguistic and Computational Perspectives,(edited by G. Altmann), Cambridge, MA: MIT Press, 122-147. Marslen-Wilson, W.D. (1975) Sentence perception as an interactive parallel process. Science,189, 226-228. Marslen-Wilson, W.D. (1987) Functional parallelism in spoken word recognition, Cognition, 25,71-102. Marslen-Wilson, W. (Ed). (1989) Lexical Representation & Process. MIT Press. Marslen-Wilson, W.D. & Welsh, A. (1978), Processing interactions and lexical access duringword recognition in continuous speech, Cognitive Psychology, 10, 29-63. Massaro, D. (1998) Perceiving Talking Faces: From Speech Perception to a BehavioralPrinciple, Cambridge, MA / London, MIT Press. McClelland, J. L. & Elman, J.F. (1986). The TRACE model of speech perception. CognitivePsychology, 18, 1-86. McNeill, D. & Lindig, K. (1973) The perceptual reality of phonemes, syllables, words, andsentences. Journal of Verbal Learning and Verbal Behavior, 12, 419–430. Miller, G.A. (1956) The magical number seven, plus or minus two: Some limits on our capacityfor processing information, Psychological Review, 63, 81-97. Miller, G.A., Heise, G.A. & Lichten, W. (1951) The intelligibility of speech as a function of thecontext of the test materials, Journal of experimental Psychology, 41, 329-335. Miller, J.L. & Liberman, A.M. (1979) Some effects of later-occurring information on theperception of stop consonant and semivowel. Percetpion and Psychophysics, 25, 457-465. Mitterer, H. & Blomert, L. (2003) Coping with phonological assimilation in speech perception:Evidence for early compensation, Perception and Psychophysics, 65, 956-969. From Sound to Sense: June 11 – June 13, 2004 at MIT24 Hawkins. History of speech perception research Norris, D.G. (1994) SHORTLIST: A hybrid connectionist model of continuous speechrecognition, Cognition, 52, 189-234. Norris, D.G., McQueen, J.M. & Cutler, A. (2000) Merging information in speech recognition:Feedback is never necessary, Behavioral and Brain Sciences, 23, 299-370. Pickett, J.M. (1999) The Acoustics of Speech Communication: Fundamentals, SpeechPerception Theory, and Technology, Boston: Allyn and Bacon. Pickett, J.M. & Pollack, I. (1963) Intelligibility of excerpts from fluent speech: Effects of rate ofutterance and duration of excerpt. Language and Speech, 6, 151-164. Pierrehumbert, J. (2002) Word-specific phonetics, In Laboratory phonology VII (Phonology andPhonetics) (edited by C. Gussenhoven & N. Warner), Berlin: Mouton de Gruyter, 101-140. Pierrehumbert, J. (2003) Probabilistic phonology: Discrimination and robustness. In ProbabilityTheory in Linguistics, (edited by R. Bod, J. Hay, & S. Jannedy), Cambridge, MA: MIT Press. Pisoni, D.B. (1973) Auditory and phonetic memory codes in the discrimination of consonantsand vowels. Perception and Psychophysics, 13, 253-260. Pols, L.C.W. (1986) Variation and interaction in speech, In Invariance and Variability in SpeechProcesses, (edited by J.S. Perkell & D.H. Klatt), Hillsdale, NJ: Lawrence ErlbaumAssociates, 140-154. Repp, B.H. & Liberman, A.M. (1987) Phonetic category boundaries are flexible, In CategoricalPerception: The Groundwork of Cognition, (edited by S. Harnad), Cambridge: CambridgeUniversity Press, 89-112. Savin, H.B. & Bever, T.G. (1970) The nonperceptual reality of the phoneme. Journal of VerbalLearning and Verbal Behavior, 9, 295–302. Scott, S.K. (2003) How might we conceptualize speech perception? The view fromneurobiology. Journal of Phonetics, 31, 417-422. Scott, S.K. & Johnsrude, I.S. (2003) The neuroanatomical and functional organization of speechperception, Trends in Neurosciences, 26, 100-107. Scott, S. K. & Wise, R.J.S. (2003) PETand fMRI studies of the neural basis of speechperception, Speech Communication, 41, 23–34. Smith, R. (2004) The Role of Fine Phonetic Detail in Word Segmentation. PhD Dissertation,Department of Linguistics, Cambridge University. Stevens, K.N. (1972) The quantal nature of speech: Evidence from articulatory-acoustic data, InHuman Communication: A Unified View (edited by E.E. David and P.B. Denes), New York:McGraw-Hill, 51-66. Stevens, K.N. (1989) On the quantal nature of speech, Journal of Phonetics, 17, 3-45. Stevens, K.N. (2002) Toward a model for lexical access based on acoustic landmarks anddistinctive features, Journal of the Acoustical Society of America, 111, 1872–1891. Stevens, K.N. & Blumstein, S.E. (1978) Invariant cues for place of articulation in stopconsonants. Journal of the Acoustical Society of America, 64, 1358-1368. Stevens, K. N., and Halle, M. (1967) Remarks on analysis by synthesis and distinctivefeatures,’’ in Models for the Perception of Speech and Visual Form, (edited by W. Wathen-Dunn), Cambridge, MA: MIT Press, 88–102. From Sound to Sense: June 11 – June 13, 2004 at MIT25 Hawkins. History of speech perception research From Sound to Sense: June 11 – June 13, 2004 at MIT26Strange, W., Jenkins, J.J. & JohnsonT.L. (1983) Dynamic specification of coarticulated vowels.Journal of the Acoustical Society of America, 74, 695-705. Sumby, W.H. & Pollack, I. (1954) Visual contribution to speech intelligibility in noise, Journal ofthe Acoustical Society of America 26, 212-215. Sussman, H.M., Fruchter, D., Hilbert, J. & Sirosh, J. (1998) Linear correlates in the speechsignal: The orderly output constraint, Behavioral and Brain Sciences 21, 241-299. Underwood, B.J. (1957) Interference and forgetting. Psychological Revue 64, 49-60. Underwood, B.J. (1982) Studies in Human Learning and Memory: Selected Papers, New York:Praeger. Valaki, C.E., Maestu, F., Simos, P.G., Zhang, W., Fernandez, A., Amo, C.M., Ortiz, T. M. &Papanicolaou, A.C. (2004) Cortical organization for receptive language functions in Chinese,English, and Spanish: a cross-linguistic MEG study, Neuropsychologia, 42, 967–979. Warren, R.M. (1970) Perceptual restoration of missing speech sounds, Science, 167, 392-3. Warren, R.M. (1999) Auditory perception: A new analysis and synthesis, Cambridge: CambridgeUniversity Press.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Requestive Speech Acts Realization Patterns: Observation from Persian

Without knowing the speech act functions, it would be difficult to make correct requests in a language. Studies in pragmalinguistics have shown that conventionally direct and indirect requestive patterns are perceived differently in different speech communities. This study investigates the perception of the requestive speech acts by Persian native speakers to determine the socially appropriate ...

متن کامل

Reliability of Interaural Time Difference-Based Localization Training in Elderly Individuals with Speech-in-Noise Perception Disorder

Background: Previous studies have shown that interaural-time-difference (ITD) training can improve localization ability. Surprisingly little is, however, known about localization training vis-à-vis speech perception in noise based on interaural time difference in the envelope (ITD ENV). We sought to investigate the reliability of an ITD ENV-based training program in speech-in-noise perception a...

متن کامل

Relationship between Working Memory, Auditory Perception and Speech Intelligibility in Cochlear Implanted Children of Elementary School

Objectives: This study examined the relationship between working and short-term memory performance, and their effects on cochlear implant outcomes (speech perception and speech production) in cochlear implanted children aged 7-13 years. The study also compared the memory performance of cochlear implanted children with their normal hearing peers. Methods: Thirty-one cochlear impl...

متن کامل

Cipher text only attack on speech time scrambling systems using correction of audio spectrogram

Recently permutation multimedia ciphers were broken in a chosen-plaintext scenario. That attack models a very resourceful adversary which may not always be the case. To show insecurity of these ciphers, we present a cipher-text only attack on speech permutation ciphers. We show inherent redundancies of speech can pave the path for a successful cipher-text only attack. To that end, regularities ...

متن کامل

Perception Development of Complex Syntactic Construction in Children with Hearing Impairment

Objectives: Auditory perception or hearing ability is critical for children in acquisition of language and speech hence hearing loss has different effects on individuals’ linguistic perception, and also on their functions. It seems that deaf people suffer from language and speech impairments such as in perception of complex linguistic constructions. This research was aimed to study the pe...

متن کامل

مقایسه سطح ادراک شنیداری و وضوح کلامی بعد از کاشت حلزون در بیماران پره‌لینگوال مبتلا به کم‌شنوایی عمیقی ارثی و غیرارثی مراجعه کننده به بیمارستان حضرت رسول اکرم(ص)

    Background & Aim: When inner ear is disturbed, both hearing sensitivity and selective property decrease. Early rehabilitation for proper progression of speech and language appropriate to age is mandatory. Several studies were performed to compare factors that affect the results of cochlear implantations to select the best candidates on the basis of different criteria. This study was underta...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004